Search CORE

Insertions and the emergence of novel protein structure: a structure-based phylogenetic study of insertions

Author: A Sali
AR Panchenko
AR Panchenko
BG Hall
C Blouin
C Chothia
C Notredame
Christian Blouin
D Frishman
EI Petersen
EV Koonin
GP Karev
Haiyan Jiang
I Van Walle
IN Shindyalov
J Casbon
J Felsentein
JM Chandonia
K Mizuguchi
L Aravind
L Holm
L Ribas De Pouplana
M Clamp
M Heinig
M Shatsky
MB Eisen
N Saitou
NV Dokholyan
NV Grishin
O O'Sullivan
O Poirot
P O'Donoghue
P O'Donoghue
R Breitling
R Development Core Team
RB Russell
S Balaji
S Guindon
S Guindon
S Pascarella
SA Benner
UG Wagner
W Humphrey
WL DeLano
WR Taylor
Y Wolf
Y Ye
Y Ye
ZY Zhu
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background In protein evolution, the mechanism of the emergence of novel protein domain is still an open question. The incremental growth of protein variable regions, which was produced by stochastic insertions, has the potential to generate large and complex sub-structures. In this study, a deterministic methodology is proposed to reconstruct phylogenies from protein structures, and to infer insertion events in protein evolution. The analysis was performed on a broad range of SCOP domain families. Results Phylogenies were reconstructed from protein 3D structural data. The phylogenetic trees were used to infer ancestral structures with a consensus method. From these ancestral reconstructions, 42.7% of the observed insertions are nested insertions, which locate in previous insert regions. The average size of inserts tends to increase with the insert rank or total number of insertions in the variable regions. We found that the structures of some nested inserts show complex or even domain-like fold patterns with helices, strands and loops. Furthermore, a basal level of structural innovation was found in inserts which displayed a significant structural similarity exclusively to themselves. The β-Lactamase/D-ala carboxypeptidase domain family is provided as an example to illustrate the inference of insertion events, and how the incremental growth of a variable region is capable to generate novel structural patterns. Conclusion Using 3D data, we proposed a method to reconstruct phylogenies. We applied the method to reconstruct the sequences of insertion events leading to the emergence of potentially novel structural elements within existing protein domains. The results suggest that structural innovation is possible via the stochastic process of insertions and rapid evolution within variable regions where inserts tend to be nested. We also demonstrate that the structure-based phylogeny enables the study of new questions relating to the evolution of protein domain and biological function.</p

BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features

Author: AP Bradley
AR Panchenko
C Yan
Caiyan Huang
CH Wu
DE Draper
E Bechara
IB Kuznetsov
JA Swets
Jack Y Yang
JC Darnell
L Wang
L Wang
Liangjiang Wang
M Terribilini
Mary Qu Yang
P Baldi
S Ahmad
S Ahmad
S Hwang
S Jones
SF Altschul
T Joachims
WS Noble
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Understanding how biomolecules interact is a major task of systems biology. To model protein-nucleic acid interactions, it is important to identify the DNA or RNA-binding residues in proteins. Protein sequence features, including the biochemical property of amino acids and evolutionary information in terms of position-specific scoring matrix (PSSM), have been used for DNA or RNA-binding site prediction. However, PSSM is rather designed for PSI-BLAST searches, and it may not contain all the evolutionary information for modelling DNA or RNA-binding sites in protein sequences. Results In the present study, several new descriptors of evolutionary information have been developed and evaluated for sequence-based prediction of DNA and RNA-binding residues using support vector machines (SVMs). The new descriptors were shown to improve classifier performance. Interestingly, the best classifiers were obtained by combining the new descriptors and PSSM, suggesting that they captured different aspects of evolutionary information for DNA and RNA-binding site prediction. The SVM classifiers achieved 77.3% sensitivity and 79.3% specificity for prediction of DNA-binding residues, and 71.6% sensitivity and 78.7% specificity for RNA-binding site prediction. Conclusions Predictions at this level of accuracy may provide useful information for modelling protein-nucleic acid interactions in systems biology studies. We have thus developed a web-based tool called BindN+ (http://bioinfo.ggc.org/bindn+/) to make the SVM classifiers accessible to the research community

IUPUIScholarWorks

eScholarship - University of California

The Energy Landscapes of Repeat-Containing Proteins: Topology, Cooperativity, and the Folding Funnels of One-Dimensional Architectures

Author: A Kohl
AK Bjorklund
Aleksandra M. Walczak
AR Panchenko
AR Panchenko
BH Zimm
C Clementi
C Clementi
CC Mello
CM Bradley
CM Bradley
CY Cheng
D Barrick
Diego U. Ferreiro
DK Klimov
DN Ivankov
DU Ferreiro
DU Ferreiro
DU Ferreiro
DU Ferreiro
E Kloss
Elizabeth A. Komives
ER Main
ER Main
H Frauenfelder
JA Schellman
JD Bryngelson
JD Bryngelson
JK Myers
KS Tang
KW Tripp
LD D'Andrea
LL Chavez
M Oliveberg
Matthew P. Jacobson
ME Zweifel
MJ Cliff
MR Ejtehadi
N Koga
ND Werbeck
O Weiss
Peter G. Wolynes
RN Venkataramani
RP Feynman
S Yang
SK Wetzel
SM Truhlar
SS Cho
T Kajander
TO Street
V Munoz
W Humphrey
Y Levy
Z Luthey-Schulten
Publication venue: Public Library of Science
Publication date: 01/05/2008
Field of study

Repeat-proteins are made up of near repetitions of 20– to 40–amino acid stretches. These polypeptides usually fold up into non-globular, elongated architectures that are stabilized by the interactions within each repeat and those between adjacent repeats, but that lack contacts between residues distant in sequence. The inherent symmetries both in primary sequence and three-dimensional structure are reflected in a folding landscape that may be analyzed as a quasi–one-dimensional problem. We present a general description of repeat-protein energy landscapes based on a formal Ising-like treatment of the elementary interaction energetics in and between foldons, whose collective ensemble are treated as spin variables. The overall folding properties of a complete “domain” (the stability and cooperativity of the repeating array) can be derived from this microscopic description. The one-dimensional nature of the model implies there are simple relations for the experimental observables: folding free-energy (ΔGwater) and the cooperativity of denaturation (m-value), which do not ordinarily apply for globular proteins. We show how the parameters for the “coarse-grained” description in terms of foldon spin variables can be extracted from more detailed folding simulations on perfectly funneled landscapes. To illustrate the ideas, we present a case-study of a family of tetratricopeptide (TPR) repeat proteins and quantitatively relate the results to the experimentally observed folding transitions. Based on the dramatic effect that single point mutations exert on the experimentally observed folding behavior, we speculate that natural repeat proteins are “poised” at particular ratios of inter- and intra-element interaction energetics that allow them to readily undergo structural transitions in physiologically relevant conditions, which may be intrinsically related to their biological functions

Public Library of Science (PLOS)

Predicting active site residue annotations in the Pfam database

Author: A Ben-Shimon
A Gutteridge
AH Elcock
AH Liu
Alex Bateman
AR Panchenko
BM Beadle
CG Nevill-Manning
CH Wu
CT Porter
D La
EL Sonnhammer
H Yao
H Yao
I Letunic
Jaina Mistry
KC Chou
KM Mayer
M Ota
MJ Zvelebil
N Hulo
ND Rawlings
NJ Mulder
NV Petrova
O Lichtarge
P Aloy
P Puntervoll
PD Dobson
R Greaves
RD Finn
RD Finn
Robert D Finn
S Velankar
SR Eddy
W Tian
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Approximately 5% of Pfam families are enzymatic, but only a small fraction of the sequences within these families (<0.5%) have had the residues responsible for catalysis determined. To increase the active site annotations in the Pfam database, we have developed a strict set of rules, chosen to reduce the rate of false positives, which enable the transfer of experimentally determined active site residue data to other sequences within the same Pfam family. Description We have created a large database of predicted active site residues. On comparing our active site predictions to those found in UniProtKB, Catalytic Site Atlas, PROSITE and <it>MEROPS </it>we find that we make many novel predictions. On investigating the small subset of predictions made by these databases that are not predicted by us, we found these sequences did not meet our strict criteria for prediction. We assessed the sensitivity and specificity of our methodology and estimate that only 3% of our predicted sequences are false positives. Conclusion We have predicted 606110 active site residues, of which 94% are not found in UniProtKB, and have increased the active site annotations in Pfam by more than 200 fold. Although implemented for Pfam, the tool we have developed for transferring the data can be applied to any alignment with associated experimental active site data and is available for download. Our active site predictions are re-calculated at each Pfam release to ensure they are comprehensive and up to date. They provide one of the largest available databases of active site annotation.</p

Mining protein loops using a structural alphabet and statistical exceptionality

Author: A Dembo
A Efimov
A Golovin
A Sacan
A Via
AC Camproux
AC Camproux
AC Camproux
Anne-Claude Camproux
AR Panchenko
AR Panchenko
B Oliva
BJ Polacco
BL Sibanda
BL Sibanda
BL Sibanda
BW Matthews
C Kiss
CG Hunter
CM Venkatachalam
D Leader
D Stuart
DF Burke
E Rocha
EG Hutchinson
EJ Milner-White
EJ Milner-White
F den Hollander
G Ausiello
G Ausiello
G Nuel
G Nuel
G Nuel
G Pugalenthi
GD Rose
Gregory Nuel
J Espadaler
J Martin
J Martin
J van Helden
J Wojcik
JF Leszczynski
JM Kwasigroch
JS Fetrow
JS Richardson
Juliette Martin
JW Sammon
JW Torrance
KC Chou
L Regad
LE Donate
Leslie Regad
LN Johnson
LR Rabiner
LS Bernstein
M Hollander
M Mönnigmann
M Saraste
MY Leung
N Colloc'h
N Fernandez-Fuentes
N Fernandez-Fuentes
O Sander
P Fuchs
PA Rice
PN Lewis
R Kolodny
S Karlin
S Kim
S Kullback
S Sourice
SA Benner
SA Benner
SD Rufino
V Pavone
W Kabsch
W Li
W Li
WL DeLano
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied. Results We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints. Conclusions We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at <url>http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/</url>.</p

Public Library of Science (PLOS)

Beta-Strand Interfaces of Non-Dimeric Protein Oligomers Are Characterized by Scattered Charged Residue Patterns

Protein oligomers are formed either permanently, transiently or even by default. The protein chains are associated through intermolecular interactions constituting the protein interface. The protein interfaces of 40 soluble protein oligomers of stœchiometries above two are investigated using a quantitative and qualitative methodology, which analyzes the x-ray structures of the protein oligomers and considers their interfaces as interaction networks. The protein oligomers of the dataset share the same geometry of interface, made by the association of two individual β-strands (β-interfaces), but are otherwise unrelated. The results show that the β-interfaces are made of two interdigitated interaction networks. One of them involves interactions between main chain atoms (backbone network) while the other involves interactions between side chain and backbone atoms or between only side chain atoms (side chain network). Each one has its own characteristics which can be associated to a distinct role. The secondary structure of the β-interfaces is implemented through the backbone networks which are enriched with the hydrophobic amino acids favored in intramolecular β-sheets (MCWIV). The intermolecular specificity is provided by the side chain networks via positioning different types of charged residues at the extremities (arginine) and in the middle (glutamic acid and histidine) of the interface. Such charge distribution helps discriminating between sequences of intermolecular β-strands, of intramolecular β-strands and of β-strands forming β-amyloid fibers. This might open new venues for drug designs and predictive tool developments. Moreover, the β-strands of the cholera toxin B subunit interface, when produced individually as synthetic peptides, are capable of inhibiting the assembly of the toxin into pentamers. Thus, their sequences contain the features necessary for a β-interface formation. Such β-strands could be considered as ‘assemblons’, independent associating units, by homology to the foldons (independent folding unit). Such property would be extremely valuable in term of assembly inhibitory drug development

CiteSeerX

Hal - Université Grenoble Alpes

HAL Descartes

HAL Université de Savoie

FigShare

How accurate and statistically robust are catalytic site predictions based on closeness centrality?

Author: A Armon
A del Sol
A Gutteridge
AG Murzin
AH Elcock
AR Panchenko
B Thibert
CA Innis
D La
D La
Dennis R Livesay
DR Livesay
DR Livesay
DR Livesay
Eric Chea
F Pazos
F Pazos
G Cheng
GJ Bartlett
GM Alter
H Yao
JD Watson
KC Usher
KV Brinda
LC Kurz
LH Greene
M Vendruscolo
MA del Sol
MJ Ondrechen
MT Neves-Petersen
NV Dokholyan
O Lichtarge
OS Soyer
P Aloy
PJ Bickel
PP Wangikar
R Landgraf
RJ Russell
S Jones
S Madabushi
W Kabsch
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background We examine the accuracy of enzyme catalytic residue predictions from a network representation of protein structure. In this model, amino acid α-carbons specify vertices within a graph and edges connect vertices that are proximal in structure. Closeness centrality, which has shown promise in previous investigations, is used to identify important positions within the network. Closeness centrality, a global measure of network centrality, is calculated as the reciprocal of the average distance between vertex <it>i </it>and all other vertices. Results We benchmark the approach against 283 structurally unique proteins within the Catalytic Site Atlas. Our results, which are inline with previous investigations of smaller datasets, indicate closeness centrality predictions are statistically significant. However, unlike previous approaches, we specifically focus on residues with the very best scores. Over the top five closeness centrality scores, we observe an average true to false positive rate ratio of 6.8 to 1. As demonstrated previously, adding a solvent accessibility filter significantly improves predictive power; the average ratio is increased to 15.3 to 1. We also demonstrate (for the first time) that filtering the predictions by residue identity improves the results even more than accessibility filtering. Here, we simply eliminate residues with physiochemical properties unlikely to be compatible with catalytic requirements from consideration. Residue identity filtering improves the average true to false positive rate ratio to 26.3 to 1. Combining the two filters together has little affect on the results. Calculated p-values for the three prediction schemes range from 2.7E-9 to less than 8.8E-134. Finally, the sensitivity of the predictions to structure choice and slight perturbations is examined. Conclusion Our results resolutely confirm that closeness centrality is a viable prediction scheme whose predictions are statistically significant. Simple filtering schemes substantially improve the method's predicted power. Moreover, no clear effect on performance is observed when comparing ligated and unligated structures. Similarly, the CC prediction results are robust to slight structural perturbations from molecular dynamics simulation.</p

Protein sequence alignment with family-specific amino acid similarity matrices

Author: A Agrawal
A Prlić
AR Panchenko
B Qian
B Rost
C Notredame
CB Do
CN Cavasotto
G Vogt
GH Gonnet
GP Raghava
I Van Walle
Igor B Kuznetsov
IN Shindyalov
J Pei
J Söding
JD Blake
JD Thompson
JM Sauder
JS Bernardes
K Mizuguchi
L Holm
L Lo Conte
ML Sierk
MO Dayhoff
MS Johnson
RB Vilim
RC Edgar
RC Edgar
RC Edgar
S Henikoff
S Salem
SB Needleman
SE Brenner
SF Altschul
SR Eddy
T Müller
TF Smith
V Ahola
WR Pearson
WR Taylor
Y Liu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties

Author: A Andreeva
A Gutteridge
AH Elcock
AR Panchenko
B Lee
B Rost
BW Mathews
CA Innis
Cathy H Wu
CH Wu
DK Smith
GJ Bartlett
H Yao
HM Berman
IH Witten
JC Platt
JD Thompson
JS Milton
K Kinoshita
K Sjolander
M Ota
MA Hearst
MJ Ondrechen
Natalia V Petrova
O Lichtarge
P Aloy
PP Wangikar
R Kohavi
R Koradi
R Landgraf
RL Tatusov
S Chakravarty
S Jones
S Parthasarathy
S Zhu
SF Altschul
SJ Campbell
SJ Hubbard
TA Binkowski
W Kabsch
W Tian
WSJ Valdar
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The number of protein sequences deriving from genome sequencing projects is outpacing our knowledge about the function of these proteins. With the gap between experimentally characterized and uncharacterized proteins continuing to widen, it is necessary to develop new computational methods and tools for functional prediction. Knowledge of catalytic sites provides a valuable insight into protein function. Although many computational methods have been developed to predict catalytic residues and active sites, their accuracy remains low, with a significant number of false positives. In this paper, we present a novel method for the prediction of catalytic sites, using a carefully selected, supervised machine learning algorithm coupled with an optimal discriminative set of protein sequence conservation and structural properties. RESULTS: To determine the best machine learning algorithm, 26 classifiers in the WEKA software package were compared using a benchmarking dataset of 79 enzymes with 254 catalytic residues in a 10-fold cross-validation analysis. Each residue of the dataset was represented by a set of 24 residue properties previously shown to be of functional relevance, as well as a label {+1/-1} to indicate catalytic/non-catalytic residue. The best-performing algorithm was the Sequential Minimal Optimization (SMO) algorithm, which is a Support Vector Machine (SVM). The Wrapper Subset Selection algorithm further selected seven of the 24 attributes as an optimal subset of residue properties, with sequence conservation, catalytic propensities of amino acids, and relative position on protein surface being the most important features. CONCLUSION: The SMO algorithm with 7 selected attributes correctly predicted 228 of the 254 catalytic residues, with an overall predictive accuracy of more than 86%. Missing only 10.2% of the catalytic residues, the method captures the fundamental features of catalytic residues and can be used as a "catalytic residue filter" to facilitate experimental identification of catalytic residues for proteins with known structure but unknown function